Introduction

This document illustrates the basic principles of performing RNA velocity analysis in R using the velociraptor package, which provides a wrapper around the scVelo python package. The document also contains exercises, indicated by Exercise.

Resources

The following resources provide additional details about the various steps covered in the document:

Data loading

We will use an example data set of pancreatic development obtained from Bastidas-Ponce et al. (2019). The RNAVelocityData.html report provides all the details on how to generate the SingleCellExperiment object that will be used for the analysis. If a smaller data set is desired, for quicker execution time, the pancreas data set can be exchanged for the spermatogenesis data set from Hermann et al. (2018), also generated in RNAVelocityData.html.

suppressPackageStartupMessages({
    library(SingleCellExperiment)
    library(scater)
})
sce <- readRDS("pancreas.rds")
sce
#> class: SingleCellExperiment 
#> dim: 27998 3696 
#> metadata(6): clusters_coarse_colors clusters_colors ... pca HVGs
#> assays(5): X spliced unspliced counts logcounts
#> rownames(27998): Xkr4 Gm37381 ... Gm20837 Erdr1
#> rowData names(1): highly_variable_genes
#> colnames(3696): AAACCTGAGAGGGATA AAACCTGAGCCTTGAT ... TTTGTCATCGAATGCT
#>   TTTGTCATCTGTTTGT
#> colData names(5): clusters_coarse clusters S_score G2M_score sizeFactor
#> reducedDimNames(2): X_pca X_umap
#> mainExpName: NULL
#> altExpNames(0):
scater::plotReducedDim(sce, dimred = "X_umap", colour_by = "clusters")

Exercise: Confirm that the example data set contains both spliced and unspliced counts. What are the corresponding assay names? What fraction of the total UMI count is assigned to each of the types?

assayNames(sce)
#> [1] "X"         "spliced"   "unspliced" "counts"    "logcounts"
total <- sum(assay(sce, "spliced")) + sum(assay(sce, "unspliced"))
(frac_spliced <- sum(assay(sce, "spliced"))/total)
#> [1] 0.8370933
(frac_unspliced <- sum(assay(sce, "unspliced"))/total)
#> [1] 0.1629067

RNA velocity estimation

Next, we will perform the RNA velocity estimation. In velociraptor, this is done using the scvelo() function, which provides a wrapper around the main functionality of scVelo. Take a moment to study the documentation of scvelo() - you will see for example that all the modes of scVelo (steady-state, stochastic, dynamical) can be used. In addition, the argument use.theirs can be set to TRUE to perform gene filtering and normalization using scVelo. If set to FALSE (default), these steps will be performed in R, which can be useful in order to increase consistency with the rest of an R-based pipeline. For a confirmation of the ability to reproduce scVelo results with velociraptor, see the CompareVelociraptorScvelo.html report.

Exercise: Run the RNA velocity analysis using the velociraptor::scvelo() function. Use the dynamical mode, and provide only the top 1000 of the estimated highly variable genes (stored in the metadata of the example data). The number and precise selection of genes to use for the velocity estimation can have a considerable impact on the results (we will come back to this later), and the “right” selection will depend on the complexity and characteristics of the data set at hand. Also, use the 50 nearest neighbors to calculate moments. This function takes a few minutes to complete, and you can follow the different steps in the messages printed by the function. In the meanwhile, think about what you expect as the output from the function (e.g., object type, dimensions).

suppressPackageStartupMessages({
    library(velociraptor)
})
velo.out <- velociraptor::scvelo(
    x = sce, 
    subset.row = metadata(sce)$HVGs[1:1000],
    mode = "dynamical",
    scvelo.params = list(moments = list(n_neighbors = 50L))
)

Exercise:

  • What is the class of velo.out?
  • What are the dimensions? Is that what you would expect?
velo.out
#> class: SingleCellExperiment 
#> dim: 1000 3696 
#> metadata(5): neighbors recover_dynamics velocity_params velocity_graph
#>   velocity_graph_neg
#> assays(10): X spliced ... velocity velocity_u
#> rownames(1000): Pyy Iapp ... Rrbp1 Emc10
#> rowData names(18): fit_r2 fit_alpha ... velocity_genes varm
#> colnames(3696): AAACCTGAGAGGGATA AAACCTGAGCCTTGAT ... TTTGTCATCGAATGCT
#>   TTTGTCATCTGTTTGT
#> colData names(8): velocity_self_transition root_cells ...
#>   velocity_confidence velocity_confidence_transition
#> reducedDimNames(1): X_pca
#> mainExpName: NULL
#> altExpNames(0):

Below we will go through the different steps covered by scvelo() in more detail, and see where the output is stored in the returned object.

Compute neighbors and moments

The first step performed by scvelo() is to estimate moments (mean and uncentered variance) of the abundances for each cell. The moments are calculated across a small set of neighboring cells (in the PCA space), for increased stability (think of it as a kind of smoothing, or imputation). Two layers, Ms and Mu, which are the first order moments (means) for spliced and unspliced abundances, are added to the data object.

## nearest neighbors (note that the nearest neighbor to each cell is the 
## cell itself)
head(metadata(velo.out)$neighbors$indices)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,]    0  638 1441  889 1099 3051 1909 3332   37  3171   216   909  1821  2497
#> [2,]    1 3300 2952 1555 1934 1896  974  856 2802  1814   368  2341  2853  2772
#> [3,]    2 1215 1830 2755 1254 1573 1689  604 3497  3065  1172  1778   842  3500
#> [4,]    3 1257  222  153 2654 3642 2095 3429  905  2042  1508  3297  1273  2398
#> [5,]    4 2737 1091 2140 1160 3549  481 3059 2580   978  2322   315   503   118
#> [6,]    5 1132 1499  827 1145 2916  505  826 2178  2091   965  3448  3663  2725
#>      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,]  1778  3254  1166  3539  2127  1052  1438  3477  3268    33    83   716
#> [2,]  1807  3372  2233  3567   873  2614  2988   950  2374  1945  1140  2134
#> [3,]  2996   205  1802  1644  2989   478  1554  3247  2480  3627  1775  3624
#> [4,]  2993  2049   901   894  1340  1681   128  1969   144  3466   982   820
#> [5,]  1811   402  3106  1183   195   892   214  2508   297   866  2081   235
#> [6,]  1871   991   328   368  2169  3416  3300  1497  1569  1051  3306  1594
#>      [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
#> [1,]  1645   603  3153   204   698  3374  2776  2023  1558  1580  1769  2733
#> [2,]   342  1667  1054  1460  3236  1132  3002  1754  1402   752   967  2636
#> [3,]  3618  1248  2258   663  3481  1747   817   514   277  2130  3268   323
#> [4,]   563  2778  2183   701   928  1732  1180  3511  1328  2110   194  1851
#> [5,]  1307   925   681  2880  1952   504  1588  1269  2893  2746   609  2639
#> [6,]   933   122  1651   816  2291  1750  3438  3022  1196  2381  2892  3032
#>      [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
#> [1,]  2277  3076  1514   140  3619  2064  2083   695  3623  1479  2438  2885
#> [2,]   281  2149   836  2763  3041  1581  3468   488  2070  1766  2091   141
#> [3,]  2017  2406   246  3623   133  3476  2175  2385  2159   733  1068  2426
#> [4,]  1293   997  3605   272  1823  1016  3423   747  2818   471   301  2190
#> [5,]  2024  3233  3206  3193  3367   722  2135  2713  1904   331  3428  3406
#> [6,]  2658  2283  1896  3630   467   836  1115  3256  2763  2233  3567  1531

## Note Ms and Mu assays
assayNames(velo.out)
#>  [1] "X"          "spliced"    "unspliced"  "Ms"         "Mu"        
#>  [6] "fit_t"      "fit_tau"    "fit_tau_"   "velocity"   "velocity_u"

Exercise: Investigate how the number of neighbors impacts the phase plot (the scatterplot of spliced vs unspliced normalized abundances). While there are ways of running individual functions from scVelo directly in the R session (e.g., as you have seen earlier today using reticulate), the easiest way here may be to run velociraptor::scvelo() repeatedly, with different number of neighbors in the moment calculations. Since we are not interested in the velocity estimates at this point, you can use the steady-state mode, which is much faster than the dynamical mode. For each number of neighbors, generate the phase plot for specific genes (e.g., Sulf2 and Xist) using the velociraptor::plotVelocity() function.

out5 <- velociraptor::scvelo(
    x = sce, 
    subset.row = metadata(sce)$HVGs[1:1300],
    mode = "steady_state",
    scvelo.params = list(moments = list(n_neighbors = 5L))
)
out5$clusters <- sce$clusters
plotVelocity(out5, genes = c("Sulf2", "Xist"), genes.per.row = 2, 
             color_by = "clusters", which.plots = "phase")


out30 <- velociraptor::scvelo(
    x = sce, 
    subset.row = metadata(sce)$HVGs[1:1300],
    mode = "steady_state",
    scvelo.params = list(moments = list(n_neighbors = 30L))
)
out30$clusters <- sce$clusters
plotVelocity(out30, genes = c("Sulf2", "Xist"), genes.per.row = 2, 
             color_by = "clusters", which.plots = "phase")


out100 <- velociraptor::scvelo(
    x = sce, 
    subset.row = metadata(sce)$HVGs[1:1300],
    mode = "steady_state",
    scvelo.params = list(moments = list(n_neighbors = 100L))
)
out100$clusters <- sce$clusters
plotVelocity(out100, genes = c("Sulf2", "Xist"), genes.per.row = 2, 
             color_by = "clusters", which.plots = "phase")


out500 <- velociraptor::scvelo(
    x = sce, 
    subset.row = metadata(sce)$HVGs[1:1300],
    mode = "steady_state",
    scvelo.params = list(moments = list(n_neighbors = 500L))
)
out500$clusters <- sce$clusters
plotVelocity(out500, genes = c("Sulf2", "Xist"), genes.per.row = 2, 
             color_by = "clusters", which.plots = "phase")

Recover dynamics

Once the data is preprocessed, the next step fits the model and infers transcription rates, splicing rates and degradation rates for each gene, as well as cell-specific latent times and transcriptional states. scVelo uses an EM algorithm for the estimation, iterating between the E-step (where each observation is assigned a time and a state (induction, repression, or steady state)) and the M-step (where the rate parameters are estimated), as outlined in the figure below from Bergen et al. (2020).

This step is only required for the dynamical model. It adds several columns to rowData(sce) (see https://scvelo.readthedocs.io/DynamicalModeling.html), including:

  • \(R^2\) of the linear fit to the steady-state cells in the phase plot (fit_r2). Note that this can be negative, if the obtained fit is worse than just using a straight line at the mean. This is used to determine which genes are used for the downstream analysis and projection of velocities into a low-dimensional space.
  • transcription rate estimates (fit_alpha)
  • splicing rate estimates (fit_beta)
  • degradation rate estimates (fit_gamma)
  • estimates of switching time points (fit_t_)
  • the likelihood value of the fit (fit_likelihood), averaged across all cells. The likelihood value for a gene and a cell indicates how well the cell is described by the learned phase trajectory.
head(rowData(velo.out)[rowData(velo.out)$velocity_genes, ])
#> DataFrame with 6 rows and 18 columns
#>         fit_r2 fit_alpha  fit_beta fit_gamma    fit_t_ fit_scaling  fit_std_u
#>      <numeric> <numeric> <numeric> <numeric> <numeric>   <numeric>  <numeric>
#> Xist 0.0816643   2.31238 26.163111  0.167319  11.70007   0.0194907  0.0314970
#> Cpe  0.7659820   3.28512  1.993591  0.189969  15.16518   0.1977422  0.7048391
#> Isl1 0.9152534  39.56635 44.674312  1.824373  12.12729   0.0162127  0.2135069
#> Nnat 0.9324625  28.09730  2.580149  0.357863  19.76561   0.0722484  2.2767632
#> Clu  0.4300578   8.67251 34.744106  0.374148   6.60626   0.0174170  0.0828731
#> Ins2 0.9949970  36.26005  0.130333  0.268981  15.37212   1.4386132 47.8505974
#>      fit_std_s fit_likelihood    fit_u0    fit_s0 fit_pval_steady fit_steady_u
#>      <numeric>      <numeric> <numeric> <numeric>       <numeric>    <numeric>
#> Xist   2.79972       0.188480         0         0        0.495581     0.113795
#> Cpe    5.88132       0.601454         0         0        0.487579     1.561235
#> Isl1   6.36506       0.422870         0         0        0.498028     0.635709
#> Nnat  14.18085       0.360123         0         0        0.499801    11.243400
#> Clu    7.24431       0.366780         0         0        0.495890     0.200323
#> Ins2  17.37919       0.304236         0         0        0.499925   312.344015
#>      fit_steady_s fit_variance fit_alignment_scaling velocity_genes
#>         <numeric>    <numeric>             <numeric>      <logical>
#> Xist      8.23593     1.423112              3.057727           TRUE
#> Cpe      11.83513     0.149404              4.185511           TRUE
#> Isl1     15.82331     0.299459              0.667317           TRUE
#> Nnat     46.33812     0.142184              2.414013           TRUE
#> Clu      17.59529     0.325777              2.156530           TRUE
#> Ins2    120.53970     0.364432              3.031129           TRUE
#>                                varm
#>                         <DataFrame>
#> Xist 3311.419:3522.173:4363.661:...
#> Cpe   623.416: 640.433: 486.004:...
#> Isl1  887.846: 408.402: 395.644:...
#> Nnat  280.220: 135.465: 121.017:...
#> Clu   727.547: 680.266: 641.781:...
#> Ins2  280.605: 244.654: 240.003:...

Compute velocities

Once the kinetic rate parameters are estimated, the actual velocities are estimated based on these. This step adds a velocity layer to the data object, and the velocity_genes column in the row data. This column indicates whether the fit for a gene is considered ‘good enough’ for downstream use. Specifically, it requires fit_r2, fit_likelihood and fit_gamma to exceed certain thresholds, fit_scaling to be within a certain range, and that both the unspliced and spliced mean values are nonzero (see the scVelo code for more details).

## Recreate the determination of velocity genes based on the estimated values and 
## check that it agrees with the velocity_genes column returned by scVelo. 
rd <- rowData(velo.out)
perc <- quantile(rd$fit_scaling[!is.na(rd$fit_scaling)], probs = c(0.1, 0.9))
table(crit = rd$fit_r2 > 0.01 & 
          rd$fit_gamma > 0.01 & 
          rd$fit_likelihood > 0.001 & 
          rd$fit_scaling > min(perc[1], 0.03) & 
          rd$fit_scaling < max(perc[2], 3) & 
          rowSums(assay(velo.out, "Ms")) > 0 & 
          rowSums(assay(velo.out, "Mu")) > 0, 
      vel = rd$velocity_genes, useNA = "ifany")
#>        vel
#> crit    FALSE TRUE
#>   FALSE   316    0
#>   TRUE      0  537
#>   <NA>    147    0

Compute the velocity graph

At this point, we have estimated the velocities - these are vectors in a \(K\)-dimensional space, where \(K\) is the number of retained genes. In order to use these velocities for downstream applications, such as estimating the future state of an individual cell or generating low-dimensional visualizations, we next estimate a so called velocity graph. To this end, scVelo calculates cosine similarities between the velocity vector for each cell and the displacement vector from that cell to each other (neighboring) cell: \[\pi_{ij}=cos\angle(\delta_{ij}, v_i),\]where \(\delta_{ij}=s_j-s_i\) is the displacement vector from cell \(i\) to cell \(j\) in gene expression space, and \(v_i\) is the velocity vector of cell \(i\). The cosine similarity takes values between -1 and +1, and a large value indicates that cell \(i\) has a high probability of transitioning towards cell \(j\) (since its velocity vector points towards cell \(j\)). The velocity graph is stored in the uns slot of the data object (and subsequently propagated to the metadata of the returned SingleCellExperiment), and is represented by a sparse \(N\times N\) matrix (where \(N\) is the number of cells). There is also a velocity_graph_neg, which is a matrix of the same size as velocity_graph, containing the negative cosine similarities.

## Velocity graph
dim(metadata(velo.out)$velocity_graph)
#> [1] 3696 3696
metadata(velo.out)$velocity_graph[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgTMatrix"
#>                       
#> [1,] . . 0.2597513 . .
#> [2,] . . .         . .
#> [3,] . . .         . .
#> [4,] . . .         . .
#> [5,] . . .         . .

If you look closely, you note that the number of non-zero elements in each row of the velocity graph (after considering both positive and negative cosine distances) is larger than the number of neighbors. The reason for this is that with default settings, scVelo performs a recursive neighbor finding (with two iterations). This can be controlled by the n_recurse_neighbors arguments to the velocity graph calculations.

Compute latent time and pseudotime

It is often of interest to obtain an ordering of the cells along a trajectory. scVelo provides two different approaches for this: pseudotime and latent time. The velocity pseudotime is obtained via a diffusion random walk on the velocity graph, and measures how many steps (on average) it takes to reach a given cell from one of the root cells. The root cells are obtained from the transition matrix. The latent time is obtained from the transcriptional dynamics fit, by relating gene-specific times (position along the phase curve) to a “universal” latent time, shared across genes.

head(colData(velo.out))
#> DataFrame with 6 rows and 8 columns
#>                  velocity_self_transition  root_cells  end_points
#>                                 <numeric>   <numeric>   <numeric>
#> AAACCTGAGAGGGATA                 0.353678 1.31333e-08 1.08882e-06
#> AAACCTGAGCCTTGAT                 0.556256 1.16838e-01 9.29647e-07
#> AAACCTGAGGCAATTA                 0.603827 9.33038e-09 2.21098e-06
#> AAACCTGCATCATCCC                 0.443273 7.14510e-01 3.93555e-07
#> AAACCTGGTAAGTGGC                 0.181264 9.07078e-08 1.46003e-06
#> AAACCTGGTATTAGCC                 0.515596 1.64557e-01 9.08715e-07
#>                  velocity_pseudotime latent_time velocity_length
#>                            <numeric>   <numeric>       <numeric>
#> AAACCTGAGAGGGATA          0.85800135   0.8851663           10.53
#> AAACCTGAGCCTTGAT          0.03519455   0.0919535           11.51
#> AAACCTGAGGCAATTA          0.92542219   0.9843534            8.85
#> AAACCTGCATCATCCC          0.00694828   0.0289450            9.77
#> AAACCTGGTAAGTGGC          0.63451403   0.6756038           11.81
#> AAACCTGGTATTAGCC          0.01391311   0.0914198            8.22
#>                  velocity_confidence velocity_confidence_transition
#>                            <numeric>                      <numeric>
#> AAACCTGAGAGGGATA            0.867448                      0.3688718
#> AAACCTGAGCCTTGAT            0.891550                      0.1025469
#> AAACCTGAGGCAATTA            0.805561                      0.0212692
#> AAACCTGCATCATCCC            0.844067                      0.3257531
#> AAACCTGGTAAGTGGC            0.928005                      0.5371662
#> AAACCTGGTATTAGCC            0.802261                      0.1929993

Combine input and output

Note that the results of scvelo() are not added to the input SingleCellExperiment object, since the two objects can have different dimensions (in particular note the difference in the number of genes). Thus, in order to use information from the velocity calculations together with information from the original object, we need to copy slots from one object to the other.

Exercise: Copy the reduced dimension representations and cell type annotations from the original SingleCellExperiment object to the output of scvelo(), and use the latter to construct UMAP representations colored by various parameters (e.g., the cell types, velocity pseudotime and latent time).

stopifnot(all(colnames(sce) == colnames(velo.out)))
reducedDims(velo.out) <- reducedDims(sce)
velo.out$clusters <- sce$clusters
cowplot::plot_grid(
    scater::plotReducedDim(velo.out, dimred = "X_umap", colour_by = "clusters"),
    scater::plotReducedDim(velo.out, dimred = "X_umap", colour_by = "velocity_pseudotime"),
    scater::plotReducedDim(velo.out, dimred = "X_umap", colour_by = "latent_time"),
    ncol = 1, align = "v"
)

As we already saw above, the rowData and colData of velo.out contain the values of various parameters estimated by scvelo():

head(rowData(velo.out)[rowData(velo.out)$velocity_genes, ])
#> DataFrame with 6 rows and 18 columns
#>         fit_r2 fit_alpha  fit_beta fit_gamma    fit_t_ fit_scaling  fit_std_u
#>      <numeric> <numeric> <numeric> <numeric> <numeric>   <numeric>  <numeric>
#> Xist 0.0816643   2.31238 26.163111  0.167319  11.70007   0.0194907  0.0314970
#> Cpe  0.7659820   3.28512  1.993591  0.189969  15.16518   0.1977422  0.7048391
#> Isl1 0.9152534  39.56635 44.674312  1.824373  12.12729   0.0162127  0.2135069
#> Nnat 0.9324625  28.09730  2.580149  0.357863  19.76561   0.0722484  2.2767632
#> Clu  0.4300578   8.67251 34.744106  0.374148   6.60626   0.0174170  0.0828731
#> Ins2 0.9949970  36.26005  0.130333  0.268981  15.37212   1.4386132 47.8505974
#>      fit_std_s fit_likelihood    fit_u0    fit_s0 fit_pval_steady fit_steady_u
#>      <numeric>      <numeric> <numeric> <numeric>       <numeric>    <numeric>
#> Xist   2.79972       0.188480         0         0        0.495581     0.113795
#> Cpe    5.88132       0.601454         0         0        0.487579     1.561235
#> Isl1   6.36506       0.422870         0         0        0.498028     0.635709
#> Nnat  14.18085       0.360123         0         0        0.499801    11.243400
#> Clu    7.24431       0.366780         0         0        0.495890     0.200323
#> Ins2  17.37919       0.304236         0         0        0.499925   312.344015
#>      fit_steady_s fit_variance fit_alignment_scaling velocity_genes
#>         <numeric>    <numeric>             <numeric>      <logical>
#> Xist      8.23593     1.423112              3.057727           TRUE
#> Cpe      11.83513     0.149404              4.185511           TRUE
#> Isl1     15.82331     0.299459              0.667317           TRUE
#> Nnat     46.33812     0.142184              2.414013           TRUE
#> Clu      17.59529     0.325777              2.156530           TRUE
#> Ins2    120.53970     0.364432              3.031129           TRUE
#>                                varm
#>                         <DataFrame>
#> Xist 3311.419:3522.173:4363.661:...
#> Cpe   623.416: 640.433: 486.004:...
#> Isl1  887.846: 408.402: 395.644:...
#> Nnat  280.220: 135.465: 121.017:...
#> Clu   727.547: 680.266: 641.781:...
#> Ins2  280.605: 244.654: 240.003:...

head(colData(velo.out))
#> DataFrame with 6 rows and 9 columns
#>                  velocity_self_transition  root_cells  end_points
#>                                 <numeric>   <numeric>   <numeric>
#> AAACCTGAGAGGGATA                 0.353678 1.31333e-08 1.08882e-06
#> AAACCTGAGCCTTGAT                 0.556256 1.16838e-01 9.29647e-07
#> AAACCTGAGGCAATTA                 0.603827 9.33038e-09 2.21098e-06
#> AAACCTGCATCATCCC                 0.443273 7.14510e-01 3.93555e-07
#> AAACCTGGTAAGTGGC                 0.181264 9.07078e-08 1.46003e-06
#> AAACCTGGTATTAGCC                 0.515596 1.64557e-01 9.08715e-07
#>                  velocity_pseudotime latent_time velocity_length
#>                            <numeric>   <numeric>       <numeric>
#> AAACCTGAGAGGGATA          0.85800135   0.8851663           10.53
#> AAACCTGAGCCTTGAT          0.03519455   0.0919535           11.51
#> AAACCTGAGGCAATTA          0.92542219   0.9843534            8.85
#> AAACCTGCATCATCCC          0.00694828   0.0289450            9.77
#> AAACCTGGTAAGTGGC          0.63451403   0.6756038           11.81
#> AAACCTGGTATTAGCC          0.01391311   0.0914198            8.22
#>                  velocity_confidence velocity_confidence_transition
#>                            <numeric>                      <numeric>
#> AAACCTGAGAGGGATA            0.867448                      0.3688718
#> AAACCTGAGCCTTGAT            0.891550                      0.1025469
#> AAACCTGAGGCAATTA            0.805561                      0.0212692
#> AAACCTGCATCATCCC            0.844067                      0.3257531
#> AAACCTGGTAAGTGGC            0.928005                      0.5371662
#> AAACCTGGTATTAGCC            0.802261                      0.1929993
#>                       clusters
#>                       <factor>
#> AAACCTGAGAGGGATA Pre-endocrine
#> AAACCTGAGCCTTGAT Ductal       
#> AAACCTGAGGCAATTA Alpha        
#> AAACCTGCATCATCCC Ductal       
#> AAACCTGGTAAGTGGC Ngn3 high EP 
#> AAACCTGGTATTAGCC Ductal

More information about the interpretation of these parameters can be found in the scVelo documentation. Let’s, for example, plot the distribution of transcription, splicing and degradation rates for all velocity genes with fit likelihood larger than 0.1:

suppressPackageStartupMessages({
    library(dplyr)
})
plotdf <- as.data.frame(rowData(velo.out)) %>%
    dplyr::filter(fit_likelihood > 0.1 & velocity_genes)
dim(plotdf)
#> [1] 532  39
cowplot::plot_grid(
    ggplot(plotdf, aes(x = fit_alpha)) + geom_histogram(bins = 50) + 
        theme_minimal() + scale_x_log10(),
    ggplot(plotdf, aes(x = fit_beta)) + geom_histogram(bins = 50) + 
        theme_minimal() + scale_x_log10(),
    ggplot(plotdf, aes(x = fit_gamma)) + geom_histogram(bins = 50) + 
        theme_minimal() + scale_x_log10(),
    nrow = 1
)

Velocity plots

The velociraptor package also provides plotting functions to visualize the RNA velocity results. For example, we can embed the estimated velocities into a provided low-dimensional space. In order to visualize the velocities in a lower-dimensional embedding, we convert the cosine similarities in the velocity graph to transition probabilities of cell-to-cell transitions by applying an exponential kernel: \[\tilde{\pi}_{ij}=\frac{1}{z_i}exp(\frac{\pi_{ij}}{\sigma_i^2}).\] The \(z_i\) are normalization factors and \(\sigma_i\) is an adaptive kernel width parameter. These transition probabilities are used to project the velocities into a low-dimensional embedding. This is achieved by weighting the normalized displacement vectors from a cell to all other cells in the low-dimensional space by the transition probabilities for cell, and taking the resulting weighted average as the low-dimensional velocity vector. More precisely, if \[\tilde{\delta}_{ij}=\frac{\tilde{s}_j-\tilde{s}_i}{\|\tilde{s}_j-\tilde{s}_i\|}\]are the normalized displacement vectors in the low-dimensional embedding, the embedded velocity is estimated by \[\tilde{v}_i=\sum_{j\neq i}\left(\tilde{\pi}_{ij} - \frac{1}{n}\right)\tilde{\delta}_{ij}.\] The \(1/n\) term is included to make the projected velocity zero when the transition probabilities represent a uniform distribution.

suppressPackageStartupMessages({
    library(ggplot2)
})

## PCA
embedded <- velociraptor::embedVelocity(reducedDim(velo.out, "X_pca")[, 1:2], velo.out)
#> ℹ Using the 'X' assay as the X matrix
grid.df <- velociraptor::gridVectors(reducedDim(velo.out, "X_pca")[, 1:2], embedded)
plotReducedDim(velo.out, dimred = "X_pca", colour_by = "velocity_pseudotime") +
    geom_segment(data = grid.df,
                 mapping = aes(x = start.1, y = start.2,
                               xend = end.1, yend = end.2),
                 arrow = arrow(length = unit(0.05, "inches")))


## UMAP
embedded <- velociraptor::embedVelocity(reducedDim(velo.out, "X_umap"), velo.out)
#> ℹ Using the 'X' assay as the X matrix
grid.df <- velociraptor::gridVectors(reducedDim(velo.out, "X_umap"), embedded)
plotReducedDim(velo.out, dimred = "X_umap", colour_by = "velocity_pseudotime") +
    geom_segment(data = grid.df,
                 mapping = aes(x = start.1, y = start.2,
                               xend = end.1, yend = end.2),
                 arrow = arrow(length = unit(0.05, "inches")))

Next, we will look at individual genes via their phase plots. This is often useful in order to understand how the velocities are affected/supported by specific genes. We can of course plot genes that we are already familiar with and that we know are related to the process of interest. We can also extract genes with particularly strong influence on the velocity results. ‘Driver genes’, which display a strong dynamic behavior, can for example be detected via their high likelihoods in the dynamic model.

Exercise: Use the velociraptor::plotVelocity() function to generate phase plots and UMAP embedding plots colored by velocity and gene expression, respectively, for the six genes with the largest fit likelihoods.

toplh <- rownames(as.data.frame(rowData(velo.out)) %>%
    dplyr::arrange(desc(fit_likelihood)) %>%
    head())
toplh
#> [1] "Pcsk2"   "Ank"     "Rps3"    "Tmem163" "Dcdc2a"  "Btbd17"
plotVelocity(velo.out, use.dimred = "X_umap", genes = toplh[1:6],
             color_by = "clusters")

Finally, we will illustrate two summary statistics returned by scVelo:

  • The length of the velocity vector encodes the speed or rate of differentiation.
  • The ‘velocity confidence’ provides a measure of the coherence of the velocity vector field, that is, how well the velocity vector for a cell correlates with those of its neighbors.
cowplot::plot_grid(
    scater::plotReducedDim(velo.out, dimred = "X_umap", colour_by = "velocity_length"),
    scater::plotReducedDim(velo.out, dimred = "X_umap", colour_by = "velocity_confidence"),
    ncol = 1, align = "v"
)

Modifying the gene selection

In the analysis above we used the most highly variable genes in the data set. However, the velocity analysis can be strongly dependent on the gene selection, especially if there are multiple dynamic processes at play in the data set.

Exercise: Rerun the velocity estimation using only genes found to be associated with the cell cycle, and plot the velocity embedding. For example, Macosko et al. (2015) provide a list of 668 mouse genes with expression patterns that varied along the cell cycle at a false discovery rate of 5% (Table S2 of https://www.cell.com/fulltext/S0092-8674(15)00549-8).

cell_cycle_genes <- read.delim("Macosko_TableS2_mouse.txt", header = FALSE)[, 1]
length(cell_cycle_genes)
#> [1] 668
length(intersect(cell_cycle_genes, rownames(sce)))
#> [1] 658
cell_cycle_genes <- intersect(cell_cycle_genes, rownames(sce))

velocc <- velociraptor::scvelo(sce, subset.row = cell_cycle_genes, 
                               mode = "dynamical")
reducedDims(velocc) <- reducedDims(sce)
velocc$clusters <- sce$clusters
embedded <- velociraptor::embedVelocity(reducedDim(velocc, "X_umap"), velocc)
#> ℹ Using the 'X' assay as the X matrix
grid.df <- velociraptor::gridVectors(reducedDim(velocc, "X_umap"), embedded)
plotReducedDim(velocc, "X_umap", colour_by = "clusters") +
    geom_segment(data = grid.df,
                 mapping = aes(x = start.1, y = start.2,
                               xend = end.1, yend = end.2),
                 arrow = arrow(length = unit(0.05, "inches")))

Similarly, run the velocity analysis on the highly variable genes that are not also among the cell cycle-associated genes. What do you expect in this case?

velonocc <- velociraptor::scvelo(sce, 
                                 subset.row = setdiff(metadata(sce)$HVGs[1:1000], cell_cycle_genes), 
                                 mode = "dynamical")
reducedDims(velonocc) <- reducedDims(sce)
velonocc$clusters <- sce$clusters
embedded <- velociraptor::embedVelocity(reducedDim(velonocc, "X_umap"), velonocc)
#> ℹ Using the 'X' assay as the X matrix
grid.df <- velociraptor::gridVectors(reducedDim(velonocc, "X_umap"), embedded)
plotReducedDim(velonocc, "X_umap", colour_by = "clusters") +
    geom_segment(data = grid.df,
                 mapping = aes(x = start.1, y = start.2,
                               xend = end.1, yend = end.2),
                 arrow = arrow(length = unit(0.05, "inches")))

Extracting cluster-specific dynamical genes

While velociraptor wraps the main velocity workflow from scVelo, the latter contains also additional functionality that may be useful. Most of these functions are, however, also accessible from within R, by explicitly calling them in a basilisk environment. Here, we will illustrate this by extracting cluster-wise “velocity genes”, which are genes that show differences in velocity between clusters. See the scVelo documentation for more details.

## Start basilisk process using the environment from velociraptor
velproc <- basilisk::basiliskStart(velociraptor:::velo.env)

## Call a function inside the basilisk process
gr <- basilisk::basiliskRun(proc = velproc, function(sce) { 
    scv <- reticulate::import("scvelo")
    ad <- zellkonverter::SCE2AnnData(sce)   ## Convert SCE to AnnData
    scv$tl$rank_velocity_genes(ad, groupby = 'clusters')   ## Apply function
    df = scv$get_df(ad, 'rank_velocity_genes/names')   ## Extract results
    as.data.frame(df)   ## Return pure R object
}, sce = velo.out)
#> ℹ Using the 'X' assay as the X matrix

## Close the process
basilisk::basiliskStop(velproc)

head(gr)
#>    Ductal   Ngn3 low EP Ngn3 high EP Pre-endocrine    Beta    Alpha   Delta
#> 0    Krt8          Idh2       Vwa5b2         Abcc8    Nnat   Rab27a  Atp1a1
#> 1 St3gal4         Ndrg1         Tecr          Scg5  Sphkap    Map1b Rpl22l1
#> 2    Pld3         Kdm7a       Samd14          Syt7 Slc31a2   Elavl4   Anxa5
#> 3    Rps3        Spint2          Id2      Pcsk2os1 Ppp1r1a   Papss2  Cldn11
#> 4  Spint2 2010107G23Rik       Atp2a3         Pcsk2     Pam     Cpt2  Tmem27
#> 5  Errfi1          Hpgd      Tmem258         Gstz1 Slc30a8 Rnaseh2c    Mapt
#>   Epsilon
#> 0   Prdx4
#> 1  Rab27a
#> 2   Syt13
#> 3   Ush1c
#> 4  Dock11
#> 5    Tmpo
## Genes with different velocity in Beta cells
plotVelocity(velo.out, use.dimred = "X_umap", genes = gr$Beta[1:3],
             color_by = "clusters")

## Genes with different velocity in Ductal cells
plotVelocity(velo.out, use.dimred = "X_umap", genes = gr$Ductal[1:3],
             color_by = "clusters")

References

Bastidas-Ponce, Aimée, Sophie Tritschler, Leander Dony, Katharina Scheibner, Marta Tarquis-Medina, Ciro Salinno, Silvia Schirge, et al. 2019. “Comprehensive Single Cell mRNA Profiling Reveals a Detailed Roadmap for Pancreatic Endocrinogenesis.” Development 146:dev173849.
Bergen, Volker, Marius Lange, Stefan Peidli, F Alexander Wolf, and Fabian J Theis. 2020. “Generalizing RNA Velocity to Transient Cell States Through Dynamical Modeling.” Nature Biotechnology 38: 1408–14.
Hermann, Brian P, Keren Cheng, Anukriti Singh, Lorena Roa-De La Cruz, Kazadi N Mutoji, I-Chung Chen, Heidi Gildersleeve, et al. 2018. “The Mammalian Spermatogenesis Single-Cell Transcriptome, from Spermatogonial Stem Cells to Spermatids.” Cell Rep. 25: 1650–1667.e8.
Macosko, Evan Z, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, et al. 2015. “Highly Parallel Genome-Wide Expression Profiling of Individual Cells Using Nanoliter Droplets.” Cell 161 (5): 1202–14.